1 Introduction

1.1 Synthetic data

In this replication package we have used the empirical data, which is under NDA. However, one can use synthetic data to evaluate our approach and reach the same conclusions.

Here we show how to load the synthetic dataset (from the data/ directory in the GitHub repository), which can then be used in the analysis.

d <- readRDS("data/data.rds")

The above line is currently not executed, since we will continue to run this analysis with the empirical data. However, by executing the above line, and not executing the next line, the analysis can be executed as-is with synthetic data.

1.2 Data cleaning

First prepare the data, check for NAs and look at some descriptive statistics. Since we’re using Excel we should be very careful when loading the data to see that nothing goes wrong (data manipulated in an arbitrary way, incompatible data types, etc.)

# Make sure to execute the previous statement, and not this one, if you want to 
# re-run the analysis
d <- read.xlsx("data/Features.xlsx", sheet = "Features")

This is how the empirical data looks like. In short, 11110 rows and 10 variables.

str(d)
## 'data.frame':    11110 obs. of  10 variables:
##  $ ID                    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ State                 : chr  "Elicited, Prio, Dropped" "Elicited, Prio, Dropped" "Elicited, Prio, Planned, Implemented, Tested, Released" "Elicited, Prio, Dropped" ...
##  $ Team.priority         : num  88 83 957 79 88 76 74 73 72 71 ...
##  $ Critical.feature      : chr  "No" "No" "No" "No" ...
##  $ Business.value        : chr  "No value" "No value" "No value" "No value" ...
##  $ Customer.value        : chr  "No value" "No value" "No value" "No value" ...
##  $ Stakeholders          : num  1 1 0 0 0 1 1 1 1 1 ...
##  $ Key.customers         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Dependency            : chr  "No" "No" "No" "No" ...
##  $ Architects.involvement: chr  "None" "None" "None" "None" ...

Each row (requirement) has a unique ID. The State, which is our outcome (dependent variable), shows how far the feature survived in the requirements process. It is of an ordered categorical type, which should be modeled with some type of Cumulative likelihood. There are seven categories:

  1. Elicited, Dropped
  2. Elicited, Prio, Dropped
  3. Elicited, Prio, Planned, Dropped
  4. Elicited, Prio, Planned, Implemented, Dropped
  5. Elicited, Prio, Planned, Implemented, Tested, Dropped
  6. Elicited, Prio, Planned, Implemented, Tested, Released

Team.priority is the relative priority the feature got, \(\mathbb{N} = \{0,\ldots,1000\}\). Critical.feature is a simple ‘Yes’/‘No’ answer (\(\mathbb{Z}_2\)). Business.value and Customer.value are also ordered categorical with three levels and a fourth level called ‘No value,’ which does not imply missingness:

  1. No value
  2. Valuable
  3. Important
  4. Critical

Stakeholders have integers, i.e., \(\mathbb{N} = \{0,\ldots,10\}\), and Key.customers the same, but with a different set, i.e., \(\mathbb{N} = \{0,\ldots,60\}\).

Finally, Dependency is \(\mathbb{Z}_2\), while Architects.involvement is ordered categorical:

  1. None
  2. Simple
  3. Monitoring
  4. Active Participation
  5. Joint Design

All ordered categorical predictors (independent variables) can be modeled as monotonic or category-specific effects, if necessary (Bürkner and Charpentier, 2020).

table(is.na(d))
## 
##  FALSE 
## 111100

No NAs in the dataset. However, that doesn’t necessarily mean that we don’t have NAs. Some of the coding can be a representation of NA, e.g., ‘No value.’ In this particular case we know that ‘No value’ and ‘None’ in the dataset actually are values and not a representation of NA.

Finally, we set correct data types.

# ordered categorical
d$State <- factor(d$State, 
                  levels = c("Elicited, Dropped", 
                             "Elicited, Prio, Dropped", 
                             "Elicited, Prio, Planned, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Tested, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Tested, Released"), 
                  ordered = TRUE)

# make sure we have integers for the other ordered categorical
d$Business.value <- factor(d$Business.value, 
                           levels = c("No value",
                                      "Valuable",
                                      "Important",
                                      "Critical"), 
                           ordered = TRUE)


d$Customer.value <- factor(d$Customer.value, 
                           levels = c("No value",
                                      "Valuable",
                                      "Important",
                                      "Critical"), 
                           ordered = TRUE)

d$Architects.involvement <- factor(d$Architects.involvement,
                                   levels = c("None",
                                              "Simple",
                                              "Monitoring",
                                              "Active Participation",
                                              "Joint Design"), 
                                   ordered = TRUE)

1.3 Descriptive statistics

First, we see that for State approximately as many features are released (final stage) as dropped in the first state. We also see that it drops off after the initial state.

For Team.priority many features have zero in priority (\(5139\)), and then there’s a bunch of them (\(1516\)) that have priority set to the maximum value, i.e., \(1000\).

For Critical.feature we have a clear emphasis on ‘No.’

Concerning Business.value and Customer.value they are fairly similar in their respective distribution (as one would expect).

For Stakeholder and Key.customers we see a strong emphasis on lower numbers, while for Dependency a clear emphasis on ‘No.’

Finally, for Architects.involvement one can see that in the absolute majority of the cases architects are not involved.

In short, it looks sort of what one would expect, i.e., it’s not hard to find answers to why the plots look the way they do.

However, before we continue, we should standardize some of our predictors so the sampling will be easier, i.e., we simply do \((x - \bar{x})/\sigma_x\), then simply multiplying with \(\sigma_x\) and adding the mean, will allow us to get back to the original scale.

# standardize and abbreviated names and change types if need be
d$prio_s <- scale(d$Team.priority)
d$sh_s <- scale(d$Stakeholders)
d$kc_s <- scale(d$Key.customers)

d$b_val <- as.integer(d$Business.value) # use int as input
d$c_val <- as.integer(d$Customer.value)
d$arch <- as.integer(d$Architects.involvement)

# Dichotomous predictors. We can set these to 1/0, but generally speaking
# one should know what one is doing and be careful doing this!
d$Critical.feature <- ifelse(d$Critical.feature == 'Yes', 1, 0)
d$Dependency <- ifelse(d$Dependency == 'Yes', 1, 0)

# only abbreviate names
d$crit <- d$Critical.feature
d$dep <- d$Dependency

2 Model design

2.1 \(\mathcal{M}_0\)

Since our outcome is of an ordered categorical nature we have many options at our hands, some of which barely existed in Bayesian modeling a few decades ago. Our outcome reminds us of a survival model, i.e., a feature needs to survive in order to reach the next stage. Taking this into account we assume that the following type of models could be an option (Bürkner and Vuorre, 2019):

  • Cumulative (single underlying continuous variable) (Samejima, 1997). A model with no predictors (\(\mathcal{M}_0\)), with predictors (\(\mathcal{M}_1\)), and with monotonic predictors (\(\mathcal{M}_2\)).
  • Adjacent-category (mathematically convenient) (Agresti, 2010). A model with predictors (\(\mathcal{M}_3\)).
  • Sequential model (higher response category is possible only after all lower categories are achieved) (Tutz, 1990). A model with predictors (\(\mathcal{M}_4\)) and with category-specific predictors (\(\mathcal{M}_5\)).

The reason we want to use monotonic or category-specific modeling of predictors is that predictor categories will not be assumed to be equidistant with respect to their effect on the outcome (Bürkner and Charpentier, 2020).

The Cumulative family is very common so let us assume a null model, \(\mathcal{M}_0\), which uses this likelihood, with no predictors. Later we will compare all our models to \(\mathcal{M}_0\), to ensure that adding predictors improve out of sample predictions, i.e., if we can’t improve when adding predictors we might as well do other things with our time.

2.1.1 Prior predictive checks

Let us see what priors such a null model needs.

(p <- get_prior(State ~ 1, 
               family = cumulative, 
               data = d))
##                 prior     class coef group resp dpar nlpar bound       source
##  student_t(3, 0, 2.5) Intercept                                       default
##  student_t(3, 0, 2.5) Intercept    1                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    2                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    3                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    4                             (vectorized)
##  student_t(3, 0, 2.5) Intercept    5                             (vectorized)
# Set a wide Normal(0,2) on the the intercepts (cutpoints for our 
# scale, which is on 6 levels, i.e., 5 cutpoints)
p$prior[1] <- "normal(0,2)"

Sample only from the priors.

# simplest cumulative model we can think of
M0 <- brm(State ~ 1, 
          family = cumulative, 
          data = d, 
          # threads = threading(4), # use if 16 CPUs
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0) # avoid printing sampling progress
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.

Plot the priors \(y_{\mathrm{rep}}\) vs. the empirical data \(y\).

pp_check(M0, type = "bars", nsamples = 250)

Evidently the medians are quite evenly set along the \(x\)-axis, and the uncertainty is fairly uniformly distributed among the categories \(1,\ldots,6\) (the bars). In short, this is what we like to see.

2.1.2 Sample with data

M0 <- brm(State ~ 1, 
          family = cumulative, 
          data = d, 
          # threads = threading(4),
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 135.1 seconds.
## Chain 3 finished in 135.2 seconds.
## Chain 2 finished in 137.6 seconds.
## Chain 4 finished in 145.3 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 138.3 seconds.
## Total execution time: 145.6 seconds.

2.1.3 Diagnostics

Our caterpillar plots look good for all parameters the model estimated (i.e., they look like fat caterpillars when they four chains have mixed well).

Diagnostics such as divergences, tree depth and energy looks good. Additionally, our \(\widehat{R}\) and effective sample size (ESS) look good.

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.1.4 Posterior predictive check

Let’s look at a posterior predictive plot to see how well the model has estimated our \(6\) levels we have in our outcome.

pp_check(M0, type = "bars", nsamples = 250)

In short, very little uncertainty (i.e., the medians are quite well estimated). Let’s leave \(\mathcal{M}_0\) for now and add predictors to the next model.

2.2 \(\mathcal{M}_1\)

Here we’ll design a Cumulative model with predictors.

For this and the coming models we won’t report on all the steps since it will take up too much space. However, rest assured, we have conducted all the steps, just as we did for \(\mathcal{M}_0\).

2.2.1 Prior predictive checks

Let’s set sane priors that are uniform on the outcome space.

p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
               family = cumulative, 
               data = d)

# Set N(0,1) on \betas and N(0,2) on the cutpoints
p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"

Sample only from the priors.

M1 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.

Plot the priors \(y_{\mathrm{rep}}\) vs. the empirical data \(y\).

pp_check(M1, type = "bars", nsamples = 250)

2.2.2 Sample with data

M1 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 3 finished in 192.6 seconds.
## Chain 1 finished in 192.9 seconds.
## Chain 2 finished in 193.2 seconds.
## Chain 4 finished in 199.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 194.6 seconds.
## Total execution time: 199.9 seconds.

2.2.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.2.4 Posterior predictive check

pp_check(M1, type = "bars", nsamples = 250)

Slight overestimation in the third category and slight underestimations in the first two categories. Shouldn’t be a problem but worth noting.

2.3 \(\mathcal{M}_2\)

A Cumulative model with monotonic predictors.

2.3.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + 
                 kc_s + dep + mo(arch),
               family = cumulative, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
p$prior[16:18] <- "dirichlet(2)" # prior for ordered categorical
M2 <- brm(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + kc_s + 
            dep + mo(arch),
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 4 finished in 0.2 seconds.
## Chain 1 finished in 0.3 seconds.
## Chain 2 finished in 0.3 seconds.
## Chain 3 finished in 0.3 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.3 seconds.
## Total execution time: 0.6 seconds.
pp_check(M2, type = "bars", nsamples = 250)

2.3.2 Sample with data

# simplest cumulative model we can think of
M2 <- brm(State ~ 1 + prio_s + crit + mo(b_val) + mo(c_val) + sh_s + kc_s + 
            dep + mo(arch),
          family = cumulative, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 4 finished in 651.3 seconds.
## Chain 1 finished in 663.5 seconds.
## Chain 3 finished in 716.0 seconds.
## Chain 2 finished in 721.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 688.1 seconds.
## Total execution time: 721.9 seconds.

2.3.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.3.4 Posterior predictive check

pp_check(M2, type = "bars", nsamples = 250)

2.4 \(\mathcal{M}_3\)

An Adjacent-category model with predictors.

2.4.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + 
                 dep + arch,
               family = acat, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M3 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = acat, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.0 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.4 seconds.
pp_check(M3, type = "bars", nsamples = 250)

2.4.2 Sample with data

M3 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = acat, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 547.7 seconds.
## Chain 4 finished in 558.9 seconds.
## Chain 3 finished in 569.6 seconds.
## Chain 2 finished in 581.4 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 564.4 seconds.
## Total execution time: 582.0 seconds.

2.4.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.4.4 Posterior predictive check

pp_check(M3, type = "bars", nsamples = 250)

2.5 \(\mathcal{M}_4\)

A Sequential model with predictors.

2.5.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + 
                 dep + arch,
               family = sratio, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M4 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.
pp_check(M4, type = "bars", nsamples = 250)

Here we see a large difference. The sratio family expects a decay. We’ll see if the data will overcome this prior (it should given that we have \(n=11110\)).

2.5.2 Sample with data

M4 <- brm(State ~ 1 + prio_s + crit + b_val + c_val + sh_s + kc_s + dep + arch,
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          control = list(adapt_delta=0.9),
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 158.2 seconds.
## Chain 3 finished in 161.4 seconds.
## Chain 2 finished in 161.9 seconds.
## Chain 4 finished in 183.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 166.2 seconds.
## Total execution time: 183.6 seconds.

2.5.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.5.4 Posterior predictive check

pp_check(M4, type = "bars", nsamples = 250)

2.6 \(\mathcal{M}_5\)

A Sequential model with category-specific predictors.

2.6.1 Prior predictive checks

p <- get_prior(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s + 
                 kc_s + dep + cs(arch),
               family = sratio, 
               data = d)

p$prior[1] <- "normal(0,1)"
p$prior[10] <- "normal(0,2)"
M5 <- brm(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s + 
            kc_s + dep + cs(arch),
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          sample_prior = "only",
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.4 seconds.
pp_check(M5, type = "bars", nsamples = 250)

2.6.2 Sample with data

M5 <- brm(State ~ 1 + prio_s + crit + cs(b_val) + cs(c_val) + sh_s + 
            kc_s + dep + cs(arch),
          family = sratio, 
          data = d, 
          # threads = threading(4), 
          prior = p,
          refresh = 0)
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 2 finished in 635.8 seconds.
## Chain 3 finished in 636.5 seconds.
## Chain 4 finished in 639.1 seconds.
## Chain 1 finished in 645.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 639.3 seconds.
## Total execution time: 646.5 seconds.

2.6.3 Diagnostics

# Check divergences, tree depth, energy
rstan::check_hmc_diagnostics(eval(M0)$fit)
## 
## Divergences:
## 0 of 4000 iterations ended with a divergence.
## 
## Tree depth:
## 0 of 4000 iterations saturated the maximum tree depth of 10.
## 
## Energy:
## E-BFMI indicated no pathological behavior.
# Check rhat and ESS
if(max(rhat(eval(M0)), na.rm=T) >= 1.01) {
  print("Warning: Rhat >=1.01")
} else {
  print("All Rhat <1.01")
}
## [1] "All Rhat <1.01"
if(min(neff_ratio(eval(M0)), na.rm=T) <= 0.2) {
  print("Warning: ESS <=0.2")
} else {
  print("All ESS >0.2")
}
## [1] "All ESS >0.2"

2.6.4 Posterior predictive check

pp_check(M5, type = "bars", nsamples = 250)

3 Model comparison

Once we’ve sampled our models we use LOO to compare the models’ relative out of sample prediction capabilities (Vehtari and Gabry, 2017).

(l <- loo_compare(loo(M0), loo(M1), loo(M2), loo(M3), loo(M4), loo(M5)))
##    elpd_diff se_diff
## M5     0.0       0.0
## M4   -43.6      10.8
## M3   -61.4      15.4
## M2  -148.1      20.3
## M1  -148.6      20.2
## M0 -3318.0      63.2

LOO puts \(\mathcal{M}_5\) as no. 1. If we assume a \(z_{\mathrm{99\%}}\)-score of \(2.58\) it’s clear that zero is not in the interval and that \(\mathcal{M}_5\) has an advantage, i.e., \(\mathrm{CI}_{z_{99\%}}\)[-71.43, -15.84].

We can also clearly see that adding predictors to \(\mathcal{M}_1\) clearly has a significant effect (compared to \(\mathcal{M}_0\)). However, adding monotonic effects, \(\mathcal{M}_2\), has very little effect (i.e., \(\mathcal{M}_1\) vs. \(\mathcal{M}_2\)), while adding category-specific effects, \(\mathcal{M}_5\), definitely does something positive to out of sample predictions (i.e., \(\mathcal{M}_5\) vs. \(\mathcal{M}_4\)).

Also worth noting is that the Cumulative models are, relatively speaking, performing the worst (i.e., \(\mathcal{M}_1\) and \(\mathcal{M}_2\)). The Sequential models take the two first spots, while the Adjacent-category model falls behind on the third spot.

If we would be interested in refining our model for out of sample prediction purposes we could conduct variable selection. However, in this particular case we are interested in each predictor’s effect, so we’ll keep them, and decide to use \(\mathcal{M}_5\) as our target model, \(\mathcal{M}\), for now.

M <- M5

4 Estimates and effects

4.1 Parameter estimates

If we now focus on our estimates we will see that our Intercepts, which in this case are the borders between two categories in our outcome, are precisely estimated. However, to make sense of them we need to transform the estimates, since the values are on the \(\mathop{\mathrm{logit}}\) scale.

Table 4.1: Population-level estimates (fixed effects) from our model \(\mathcal{M}\). Crosspoints are in italics and significant population-level estimates are in bold. Note that the values are on \(\mathrm{logit}\) scale.
Estimate Est.Error Q2.5 Q97.5
Intercept[1] -0.84 0.07 -0.97 -0.71
Intercept[2] -0.80 0.07 -0.94 -0.66
Intercept[3] 0.43 0.08 0.27 0.59
Intercept[4] -1.14 0.12 -1.37 -0.92
Intercept[5] -3.26 0.25 -3.73 -2.76
prio_s 1.22 0.02 1.18 1.26
crit 0.62 0.05 0.52 0.71
sh_s -0.05 0.02 -0.08 -0.02
kc_s 0.01 0.02 -0.02 0.04
dep 0.09 0.04 0.01 0.18
b_val[1] 0.19 0.03 0.13 0.25
b_val[2] -0.04 0.03 -0.11 0.02
b_val[3] 0.15 0.04 0.06 0.23
b_val[4] -0.18 0.06 -0.30 -0.06
b_val[5] -0.09 0.14 -0.35 0.20
c_val[1] 0.00 0.04 -0.07 0.08
c_val[2] 0.05 0.04 -0.04 0.13
c_val[3] 0.13 0.06 0.02 0.24
c_val[4] 0.19 0.08 0.03 0.34
c_val[5] -0.01 0.16 -0.33 0.30
arch[1] 0.13 0.05 0.03 0.22
arch[2] 0.09 0.05 0.00 0.18
arch[3] 0.03 0.05 -0.06 0.13
arch[4] -0.23 0.06 -0.34 -0.11
arch[5] -0.28 0.12 -0.51 -0.03

Let’s take the first row, Intercept[1], which was estimated to -0.84, i.e.,

inv_logit_scaled(fixef(M)[1,1])
## [1] 0.3007962

What does 0.3 mean? Well, we have an ordered categorical outcome. So 30% of the probability mass, with a 95% credible interval of [0.27, 0.33], was assigned to the first category: Elicited, Dropped. For Intercept[2], 0.31 [0.28, 0.34] was assigned to: Elicited, Dropped and Elicited, Prio, Dropped (i.e., the first two categories combined). Honestly, this is not that interesting; we want to know if the predictors are making a difference, and how large of a difference they make.

Let’s turn our attention to the other parameters. The prio_s parameter, 1.22, would then become 0.77 when transformed. But remember the suffix _s! We need to multiply with \(\sigma_x\) and add \(\bar{x}\) from the data, which leads to an estimate of 689.74 [686.78, 692.61].

The more exotic parameters are our category-specific effects b_val, c_val, and arch. To our knowledge this has never been used in an analysis in software engineering or computer science. We see that each parameter that is ordinal has five category-specific effects estimated (e.g., rows \(11\)\(15\) in the table above). These estimates indicate to what degree b_val affected the six outcomes in State (remember, our outcome consisted of six ordered categorical levels).

If we continue taking b_val as an example, the \(95\)% credible interval does not cross \(0\) for the \(1\)st, \(3\)rd, and \(4\)th parameter (see the table above), and is clearly positive in the first two cases and negative in the last case. This means that b_val affects the intercepts (cutpoints) positively between the \(1\)st and \(2\)nd categories and between the \(3\)rd and \(4\)th categories, while the opposite is true for the negative effect on the intercept between the \(4\)th and \(5\)th categories. These effects would have been impossible to estimate without modeling it separately as we did in \(\mathcal{M}\).

However, looking at point estimates is, quite frankly, not terribly useful. Let’s plot the posterior probability densities for our population-level estimates on the \(\mathop{\mathrm{logit}}\) scale, disregarding Intercept\([1,\ldots,5]\) (we are, after all, interested in the effect our predictors have on the outcome, not the outcome itself).

Examining the above plot, from bottom to top, we can say that on the \(95\)%-level, the first three parameters are clearly positive or negative and do not cross \(0\). The fourth parameter, Key customers, is not significant, with the following \(95\)% credible intervals [-0.02, 0.04]. The fifth parameter, Dependencies, is significant (positive) [0.01, 0.18]. For Business value, Customer value, and Architects' involvement we see that some parameters are significant, i.e., for Business value the parameters \([1]\) and \([4]\) and for Customer value parameter \([3]\). There are no significant parameters in Architects' involvement.

To conclude what we’ve noticed so far: Priority, Critical feature, Stakeholders, and Dependencies are significant on \(\mathrm{CI}_{95\%}\), while for Business value, Customer value, and Architecture involvement some categories are significant. Key customers are not significant.

4.2 Conditional effects

Below we plot all conditional effects for our model. The colors represent the different ordered categories, \(1,\ldots,6\), for our outcome State. We are particularly interested in Category \(6\) (pink), i.e., the final category which indicates a released feature.

  1. Elicited, Dropped
  2. Elicited, Prio, Dropped
  3. Elicited, Prio, Planned, Dropped
  4. Elicited, Prio, Planned, Implemented, Dropped
  5. Elicited, Prio, Planned, Implemented, Tested, Dropped
  6. Elicited, Prio, Planned, Implemented, Tested, Released

5 Conclusions

One important question we would like to have an answer to is which independent variable(s) contribute(s) more for a feature to, ultimately, be released, i.e., is it priority, criticality, business or customer value, number of stakeholders, number of key customers, having dependencies, and/or the level of architect involvement? In the above plots the answer to our question can be found, without even having to conduct any statistical tests or examining \(p\)-values.

Concerning Priority we see that it has a very large effect for State \(6\) (i.e., a feature being released). The higher the priority the more probability mass is set on State \(6\). In the end it has close to \(70\)% of the probability mass, while the other states are not even close.

Concerning Criticality we see very much the same effect, albeit the uncertainty increases also. States \(3\) and \(6\) have, together, more than \(50\)% of the probability mass. However, one thing we clearly see in the Stakeholders plot is that State \(6\) has very little probability mass when number of stakeholders increases, i.e., the more stakeholders the lower the probability that a feature will be released.

Key customers doesn’t tell us much due to the very large uncertainty in State \(6\). For Dependencies not much changes when it increases.

Business value and Customer value play a small role if we look at the plots above.

Finally, for Architects involvement we see that the probability for State \(6\) increases when going from none to Simple, but then it remains virtually the same for the other categories.

In short, Priority and Criticality have the largest effects, while the rest don’t matter all too much. We don’t even need a test to see that.

6 References

Agresti, A., 2010. Analysis of ordinal categorical data. John Wiley & Sons.
Bürkner, P.-C., Charpentier, E., 2020. Modelling monotonic effects of ordinal predictors in Bayesian regression models. British Journal of Mathematical and Statistical Psychology 73, 420–451. https://doi.org/10.1111/bmsp.12195
Bürkner, P.-C., Vuorre, M., 2019. Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science 2, 77–101. https://doi.org/10.1177/2515245918823199
Samejima, F., 1997. Graded response model, in: Handbook of Modern Item Response Theory. Springer, pp. 85–100.
Tutz, G., 1990. Sequential item response models with an ordered response. British Journal of Mathematical and Statistical Psychology 43, 39–55. https://doi.org/10.1111/j.2044-8317.1990.tb00925.x
Vehtari, G., A., Gabry, J., 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

7 Appendix

7.1 Computation time

end.time <- Sys.time()
round((end.time - start.time), 1)
## Time difference of 1.4 mins

7.2 Environment

print(sessionInfo(), locale=FALSE)
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.1.1  kableExtra_1.3.4 forcats_0.5.1    stringr_1.4.0   
##  [5] dplyr_1.0.4      purrr_0.3.4      readr_1.4.0      tidyr_1.1.2     
##  [9] tibble_3.0.6     tidyverse_1.3.0  bayesplot_1.8.0  brms_2.14.11    
## [13] Rcpp_1.0.6       ggthemes_4.2.4   ggplot2_3.3.3    openxlsx_4.2.3  
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1            backports_1.2.1         systemfonts_1.0.1      
##   [4] plyr_1.8.6              igraph_1.2.6            splines_4.0.4          
##   [7] crosstalk_1.1.1         rstantools_2.1.1        inline_0.3.17          
##  [10] digest_0.6.27           htmltools_0.5.1.1       rsconnect_0.8.16       
##  [13] magrittr_2.0.1          modelr_0.1.8            RcppParallel_5.0.2     
##  [16] matrixStats_0.58.0      xts_0.12.1              svglite_2.0.0          
##  [19] prettyunits_1.1.1       colorspace_2.0-0        rvest_0.3.6            
##  [22] haven_2.3.1             xfun_0.21               callr_3.5.1            
##  [25] crayon_1.4.1            jsonlite_1.7.2          lme4_1.1-26            
##  [28] zoo_1.8-8               glue_1.4.2              gtable_0.3.0           
##  [31] webshot_0.5.2           V8_3.4.0                pkgbuild_1.2.0         
##  [34] rstan_2.26.0.9000       abind_1.4-5             scales_1.1.1           
##  [37] mvtnorm_1.1-1           DBI_1.1.1               miniUI_0.1.1.1         
##  [40] viridisLite_0.3.0       xtable_1.8-4            stats4_4.0.4           
##  [43] StanHeaders_2.26.0.9000 DT_0.17                 htmlwidgets_1.5.3      
##  [46] httr_1.4.2              threejs_0.3.3           ellipsis_0.3.1         
##  [49] farver_2.0.3            pkgconfig_2.0.3         loo_2.4.1              
##  [52] sass_0.3.1              dbplyr_2.1.0            labeling_0.4.2         
##  [55] tidyselect_1.1.0        rlang_0.4.10            reshape2_1.4.4         
##  [58] later_1.1.0.1           munsell_0.5.0           cellranger_1.1.0       
##  [61] tools_4.0.4             cli_2.3.0               generics_0.1.0         
##  [64] broom_0.7.5             ggridges_0.5.3          evaluate_0.14          
##  [67] fastmap_1.1.0           yaml_2.2.1              processx_3.4.5         
##  [70] knitr_1.31              fs_1.5.0                zip_2.1.1              
##  [73] nlme_3.1-152            mime_0.10               projpred_2.0.2         
##  [76] xml2_1.3.2              compiler_4.0.4          shinythemes_1.2.0      
##  [79] rstudioapi_0.13         curl_4.3                gamm4_0.2-6            
##  [82] reprex_1.0.0            statmod_1.4.35          bslib_0.2.4            
##  [85] stringi_1.5.3           highr_0.8               ps_1.5.0               
##  [88] Brobdingnag_1.2-6       lattice_0.20-41         Matrix_1.3-2           
##  [91] nloptr_1.2.2.2          markdown_1.1            shinyjs_2.0.0          
##  [94] vctrs_0.3.6             pillar_1.4.7            lifecycle_1.0.0        
##  [97] jquerylib_0.1.3         bridgesampling_1.0-0    httpuv_1.5.5           
## [100] R6_2.5.0                bookdown_0.21           promises_1.2.0.1       
## [103] gridExtra_2.3           codetools_0.2-18        boot_1.3-27            
## [106] colourpicker_1.1.0      MASS_7.3-53.1           gtools_3.8.2           
## [109] assertthat_0.2.1        rprojroot_2.0.2         withr_2.4.1            
## [112] shinystan_2.5.0         mgcv_1.8-34             parallel_4.0.4         
## [115] hms_1.0.0               grid_4.0.4              coda_0.19-4            
## [118] minqa_1.2.4             cmdstanr_0.3.0.9000     rmarkdown_2.7          
## [121] shiny_1.6.0             lubridate_1.7.9.2       base64enc_0.1-3        
## [124] dygraphs_1.1.1.6